Efficient Web Data Mining with Standard XML Technologies

نویسندگان

  • P. N. Santosh Kumar
  • Sunil Kumar
چکیده

The problem of Web data extraction and XML-based methodology whose goal extends far beyond simple “screen scraping are discussed.” An ideal data extraction process is able to digest target Web databases that are visible only as HTML pages, and create a local, identical replica of those databases as a result. What is needed in this process is much more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process needs to deal with such roadblocks such as session identifiers, HTML forms, and client-side JavaScript, and data integration problems such as incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires a solid data validation and error recovery service to handle data extraction failures, which are unavoidable. In this paper we describe NDES, a software framework that makes significant advances in solving these problems and provides a platform for building a productionquality Web data extraction process. Key aspects of NDES are that it uses XML technologies for data extraction, including XHTML and XSLT, and provides access to the “deep Web.”

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved Web Mining Technique to Fetch Web Data Using Apriori and Decision Tree

World Wide Web is the largest source of information. Most of the data on the web is dynamic and is in unstructured form. It is becoming difficult to get the relevant data from the web. Data Mining is the field of computer science which is used to extract knowledge from very large amount of data. Web mining is the application of data mining, which implements various techniques of data mining to ...

متن کامل

Partitions musicales et technologies web

This papers show that new web technologies such as SVG, DOM, AJAX and CSS, are now mature enough to allow browsing of musical scores with optimal quality for the graphical and ergonomical parts, together with XML powerfull standard data-mining tools. MOTS-CLÉS : AJAX, DOM, DTD, CSS, partitions musicales, MusicXML, SAX, SVG, web.

متن کامل

A Framework For Extracting Information From Web Using VTD-XML‘s XPath

The exponential growth of WWW (World Wide Web) is the cause for vast pool of information as well as several challenges posed by it, such as extracting potentially useful and unknown information from WWW. Many websites are built with HTML, because of its unstructured layout, it is difficult to obtain effective and precise data from web using HTML. The advent of XML (Extensible Markup Language) p...

متن کامل

The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme1

This paper describes the rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM. The design of this platform is based on a synthesis of the best available algorithms in existing textometry software. It also relies on identifying the most relevant open-source technologies for processing textual resources encoded in XML and Unicode, for efficien...

متن کامل

The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme

This paper describes the rationale and design of an XML-TEI encoded corpora compatible analysis platform for text mining called TXM. The design of this platform is based on a synthesis of the best available algorithms in existing textometry software. It also relies on identifying the most relevant open-source technologies for processing textual resources encoded in XML and Unicode, for efficien...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017